[1] 1
27 Correlation
27.1 Karl Pearson’s Coefficient of Correlation
Karl Pearson’s correlation coefficient, also known as the Pearson product-moment correlation coefficient (PPMCC) or simply Pearson’s correlation, is a measure used in statistics to determine the degree of linear relationship between two variables. It’s widely used in the sciences to quantify the linear correlation between datasets.
27.1.1 Assumptions
Pearson’s correlation requires certain assumptions about the data it is used to analyze:
- Linearity: The relationship between the two variables should be linear.
- Homoscedasticity: The variances along the line of best fit remain similar as the value of the predictor variable increases.
- Normally Distributed Variables: Both variables being tested should follow a normal distribution.
27.1.2 Formula
The Pearson correlation coefficient (\(r\)) is calculated using the following formula: \[ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \] Where: - \(n\) is the number of data points. - \(x\) and \(y\) are the variables for which the correlation is being calculated. - \(\sum\) represents the summation symbol, aggregating all values of \(x\), \(y\), \(xy\), \(x^2\), and \(y^2\).
27.1.3 Interpretation
The value of \(r\) ranges from -1 to +1: - +1 indicates a perfect positive linear relationship, - -1 indicates a perfect negative linear relationship, - 0 means no linear relationship exists. Values close to +1 or -1 indicate a strong relationship, while values close to 0 indicate a weak relationship.
27.1.4 Example Problem
Suppose we want to determine the relationship between hours studied and scores obtained in an exam. Here are the data for 5 students:
- Hours Studied: 2, 4, 6, 8, 10
- Scores: 20, 40, 60, 80, 100
Hypotheses:
- Null Hypothesis (H₀): There is no linear correlation between hours studied and scores (\(r = 0\)).
- Alternative Hypothesis (H₁): There is a linear correlation between hours studied and scores (\(r \neq 0\)).
Calculate Pearson’s r:
Using the data points provided, the calculation involves first calculating sums and products needed:
- Sum of Hours Studied (\(\sum x\)): \(2 + 4 + 6 + 8 + 10 = 30\)
- Sum of Scores (\(\sum y\)): \(20 + 40 + 60 + 80 + 100 = 300\)
- Sum of the product of hours and scores (\(\sum xy\)): \(2*20 + 4*40 + 6*60 + 8*80 + 10*100 = 1300\)
- Sum of the squares of hours (\(\sum x^2\)): \(2^2 + 4^2 + 6^2 + 8^2 + 10^2 = 220\)
- Sum of the squares of scores (\(\sum y^2\)): \(20^2 + 40^2 + 60^2 + 80^2 + 100^2 = 30000\)
Plugging these values into the formula gives: \[ r = \frac{5(1300) - (30)(300)}{\sqrt{[5(220) - (30)^2][5(30000) - (300)^2]}} = 1 \]
Conclusion:
Since \(r = 1\), there is a perfect positive linear relationship between the hours studied and the scores obtained, supporting the alternative hypothesis that there is a significant linear relationship.
Pearson’s Correlation using Excel:
Download the Excel file link here
Pearson’s Correlation using R:
Pearson’s Correlation using Python:
Code
Python
import numpy as np
# Data for calculation
= np.array([2, 4, 6, 8, 10])
hours_studied = np.array([20, 40, 60, 80, 100])
scores
# Perform Pearson's correlation
= np.corrcoef(hours_studied, scores)[0, 1]
correlation_coefficient
# Print the results
print("Pearson's correlation coefficient:", correlation_coefficient)
Pearson's correlation coefficient: 1.0